首先介紹的是發音詞典處理
,我們必須先準備一份發音詞典(lexicon)
,格式會長得像以下的樣子。左邊就是詞,右邊則是這個詞對應的拼音,不同的拼音系統會產生出不同的拼音結果(ex:Formosa Phonetic Alphabet(ForPA)、漢語拼音、國際音標IPA等)
#lexicon.txt
一兩百元 i:1 l j A: N3 p aI3 H A: n2
一兩百年 i:1 l j A: N3 p aI3 n j A: n2
一兩百萬 i:1 l j A: N3 p aI3 w A: n4
一兩萬 i:1 l j A: N3 w A: n4
一兩萬元 i:1 l j A: N3 w A: n4 H A: n2
一兩週 i:1 l j A: N3 ttss oU1
一兩隻 i:1 l j A: N3 ttss1
一兩項 i:1 l j A: N3 s6 j A: N4
一兩顆 i:1 l j A: N3 k_h ax1
一兩點 i:4 l j A: N3 t j A: n3
而我們可以使用 local/prepare_dict.sh
這一個script 來處理 lexicon
#local/prepare_dict.sh
source_dir=<lexicon.txt path>
dict_dir=data/local/dict
rm -rf $dict_dir
mkdir -p $dict_dir
rm -f $dict_dir/lexicon.txt
touch $dict_dir/lexicon.txt
cat $source_dir/lexicon.txt > $dict_dir/lexicon.txt
echo "<SIL> SIL" >> $dict_dir/lexicon.txt
#
# define silence phone
#
rm -f $dict_dir/silence_phones.txt
touch $dict_dir/silence_phones.txt
echo "SIL" > $dict_dir/silence_phones.txt
#
# find nonsilence phones
#
rm -f $dict_dir/nonsilence_phones.txt
touch $dict_dir/nonsilence_phones.txt
cat $source_dir/lexicon.txt | grep -v -F -f $dict_dir/silence_phones.txt | \
perl -ane 'print join("\n", @F[1..$#F]) . "\n"; ' | \
sort -u > $dict_dir/nonsilence_phones.txt
#
# add optional silence phones
#
rm -f $dict_dir/optional_silence.txt
touch $dict_dir/optional_silence.txt
echo "SIL" > $dict_dir/optional_silence.txt
#
# extra questions
#
rm -f $dict_dir/extra_questions.txt
touch $dict_dir/extra_questions.txt
cat $dict_dir/silence_phones.txt | awk '{printf("%s ", $1);} END{printf "\n";}' > $dict_dir/extra_questions.txt || exit 1;
cat $dict_dir/nonsilence_phones.txt | awk '{printf("%s ", $1);} END{printf "\n";}' >> $dict_dir/extra_questions.txt || exit 1;
echo "Dictionary preparation succeeded"
exit 0;
執行這個script後會在 data/local/dict
下產生 lexicon.txt 、optional_silence.txt、nonsilence_phones.txt、silence_phones.txt
這四個檔案
接下來是資料集(語料庫)的部分,要能夠進到kaldi模型中訓練的話必須依照kaldi的格式準備出一些檔案,其中以下三項是最為重要的:
<utterance-id> <wav-file-path>
<utterance-id> <speaker-id>
<utterance-id <transcription>
這三個檔案則可以透過local/prepare_data.sh
產生,其他的檔案像是 spk2utt 可以直接透過 kaldi內建的工具進行轉換utils/utt2spk_to_spk2utt.pl data/train/utt2spk > data/train/spk2utt
而 spk2gender 則是要看訓練的過程中會不會用到性別的資訊。
local/prepare_data.sh 參考程式如下
# local/prepare_data.sh
#!/bin/bash
set -e -o pipefail
train_dir=<train-data-path>
eval_dir=NER-Trs-Vol1-Eval
. ./path.sh
. parse_options.sh
for x in $train_dir ; do
if [ ! -d "$x" ] ; then
echo >&2 "The directory $x does not exist"
fi
done
if [ -z "$(command -v dos2unix 2>/dev/null)" ]; then
echo "dos2unix not found on PATH. Please install it manually."
exit 1;
fi
# have to remvoe previous files to avoid filtering speakers according to cmvn.scp and feats.scp
rm -rf data/all data/train data/test data/eval data/local/train
mkdir -p data/all data/train data/test data/eval data/local/train
# make utt2spk, wav.scp and text
find $train_dir -name *.wav -exec sh -c 'x={}; y=$(basename -s .wav $x); printf "%s %s\n" $y $y' \; | dos2unix > data/all/utt2spk
find $train_dir -name *.wav -exec sh -c 'x={}; y=$(basename -s .wav $x); printf "%s %s\n" $y $x' \; | dos2unix > data/all/wav.scp
find $train_dir -name *.txt -exec sh -c 'x={}; y=$(basename -s .txt $x); printf "%s " $y; cat $x' \; | dos2unix > data/all/text
# fix_data_dir.sh fixes common mistakes (unsorted entries in wav.scp,
# duplicate entries and so on). Also, it regenerates the spk2utt from
# utt2spk
utils/fix_data_dir.sh data/all
echo "Preparing train and test data"
# test set: JZ, GJ, KX, YX
grep -E "(JZ|GJ|KX|YX)_" data/all/utt2spk | awk '{print $1}' > data/all/cv.spk
utils/subset_data_dir_tr_cv.sh --cv-spk-list data/all/cv.spk data/all data/train data/test
# for LM training
echo "cp data/train/text data/local/train/text for language model training"
cat data/train/text | awk '{$1=""}1;' | awk '{$1=$1}1;' > data/local/train/text
# preparing EVAL set.
find $eval_dir -name *.wav -exec sh -c 'x={}; y=$(basename -s .wav $x); printf "%s %s\n" $y $y' \; | dos2unix > data/eval/utt2spk
find $eval_dir -name *.wav -exec sh -c 'x={}; y=$(basename -s .wav $x); printf "%s %s\n" $y $x' \; | dos2unix > data/eval/wav.scp
find $eval_dir -name *.txt -exec sh -c 'x={}; y=$(basename -s .txt $x); printf "%s " $y; cat $x' \; | dos2unix > data/eval/text
utils/fix_data_dir.sh data/eval
echo "Data preparation completed."
exit 0;
明天將會繼續介紹神經網路模型訓練的部分。
參考資料: